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Construct Equivalence of a National Certification Examination 
That Uses Dual Languages and Audio Assistant 

Objective 

The purposes of this study are: (a) investigate the factorial (structure) validity of a national 
certification examination; (b) assess the construct equivalence of a national certification 
examination across different languages with or without audio assistant; and (c) provide an 
example of how to extend validity evidence beyond the methodology typically used in 
certification testing. 

Perspective 

Certification tests are designed to assess professional competence. Like other credentialing 
tools, certification tests are intended to help the public, employers, and government agencies 
identify practitioners who have met a particular standard. Certification organization have a 
responsibility not only to candidates— to ensure that all certification procedures are fair and 
consistent— but also to the consumer— to ensure the validity of the certification process so that 
individuals who are certified are indeed competent. Like any high-stakes tests, certification tests 
must satisfy the legal requirements of validation and fairness. Validity (and fairness), according 
to the Standards for Educational and Psychological Testing (AERA, APA, NCME, 1999), is the 
most important consideration in test development and evaluation. Validity and fairness of the 
certification tests are also required by federal laws and regulations (Equal Employment 
Opportunity Commission [EEOC], Civil Service Commission, Department of Labor and 
Department of Justice, 1978; Mehrens, 1994; Mehrens & Popham, 1992). The degree to which 
different language tests (with/without audio assistant) are comparable is an important validity 
issue because these tests are typically interpreted as if they are equivalent or use the same cut 
score to determine the pass/fail status of examinees. The trend of using multi-language tests will 
continue to grow because of the increase in globalization of markets, linguistic diversity, and 
culture exchanges. Like achievement tests (Gierl, 2001; Sired & Khaliq, 2002; Hambleton, 
2001), in order to reduce construct-irrelevant variance in a candidate’s test scores due to 
proficiency in a specific language, many certification tests are adapted for different languages. 
Comparing competence of candidates who take different language tests is not an easy task 
because the differences in test scores between different languages could be due to either 
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competence differences or to psychometrical differences (Angoff & Cook, 1998; Geisinger, 

1994; Hambleton, 1993, 1994; Prieto, 1992; Sired, 1997). According to the Guidelines (ITC, 
2002) and Gierl (2001), the comparability between constructs of dual language tests is the prior 
consideration before any other psychometric attempts to link or use IRT methods, such as IRT 
equating or differential item functioning (DIF), because of assumptions about latent trait of IRT 
models. Although the DIF analyses can be used to identify problematic items at item level, the 
factor structure of the tests can only be evaluated at the total test score level (Sired, 1997; van 
der Vijver & Tanzer, 1998). 

If the comparison between tests using different languages is meaningful, the construct 
measured by the tests must be equivalent (Gierl, Rogers, & Klinger, 1999; Hambleton, 1994; . 

Hulin, 1987; van de Vijver & Hambleton, 1996; van de Vijver & Poortinga, 1997). The 
construct underlying a test is a theoretical representation of the underlying trait, concept, 
attribute, process, or structures the test is designed to measure (Cronbach, 1971; Messick, 1989). 
The construct equivalent will be achieved if the same construct is measured across different 
groups over different factors such as language and administration. 

The National Nurse Aide Assessment Program (NNAAP) consists of written examination 
forms that use different languages (English and Spanish) under different administration 
conditions (with or without audio equipment assistant.) Therefore, determining test validity and 
fairness for the various examination modes was undertaken. 

In seeking evidence of test validity and fairness, the research should address questions such 
as whether the test measures the same construct for all relevant populations. The difference in 
test scores of the NNAAP among examinees who use a different language with or without audio 
assistant could be due to language differences, administration condition (audio assist) 
differences, or true competence difference. The additional administration mode option makes 
comparisons between the NNAAP forms even more difficult than a comparison of different 
languages. Although countless studies using structural equation modeling (SEM), scaling, and 
exploratory factor analysis have been conducted to assess the structural equivalency of tests 
across language and cultural group (Gierl, 1999; Reise, Widaman, & Pugh, 1993; Robin, Sieci, 

& Hambleton, 2000; Sired & Allalouf, in press), none of these studies using SEM has been done 
to evaluate the equivalence across language under different administration conditions. 
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This study investigates the structure of the certification test for the National Nurse Aide 
Assessment Program with regard to psychometric equivalence across dual languages and 
administration conditions. 

Method and Data 

Instrument 

The National Nurse Aides Assessment Program is a nationally administered certifying 
examination program that is based on the activities and knowledge required for competent 
performance by nurse aides in long-term care, acute care, and home health care settings. The 
NNAAP consists of two components: the first is a knowledge test that is referred to as the 
written examination, and the second one is a skill demonstration that is called a skill evaluation. 
This study focuses on the written exam only. 

The written exam forms are created according to a content outline based on the results of a 
job analysis conducted by the National Council of State Boards of Nursing (1995). The job 
analysis identified the most important activities performed by nurse aides across all settings and 
the knowledge required for performing each activity. Using the job analysis results, the subject 
matter experts developed the content outline and assigned proportionate weightings to each 
content area. Three major content areas are defined: (I) Physical care skills (47%), (II) 
Psychosocial care skills (22%), and (III) Role of the nurse aide (31%). Each of the major content 
area includes different subcontent areas. Physical care skills include: activities of daily living, 
basic nursing skills, and restorative skills. Psychosocial care skills include: emotional and 
mental health needs, and spiritual and culture needs. Role of the nurse aide include: 
communication, client right, legal and ethical behavior, and being a member of the health care 
team. Each written exam form consists of 70 multiple-choice items. Sixty are used to determine 
a candidate’s test score, and the remaining ten items are pre-test items. 

There are three formats for the NNAAP written exam. The standard format consists of test 
items written in English. For candidates with limited reading proficiency, the written English 
test items are used and administered together with cassette tapes that present directions and test 
questions orally. For Spanish speakers, the English version is translated into Spanish and 
administered with Spanish language tapes in the same mode as the English language oral 
administration. 

Sample 
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In the years 1998 to 2000, a total of 273,492 nurse aide candidates from 31 states took the 
national nurse aides examination. A sample of 20,568 candidate test responses to one 
examination form created in 1998 was used in this study. These candidates were selected from 4 
geographically representative states: Colorado, Florida, New Jersey, and Texas. Among the 
candidates, 10,908 (53%) took the English version of the exam form (E), 6,140 (30%) took the 
English with audio-tape version (EA), and 3,520 (17%) took the Spanish with audio-tape version 
(SA) of the same exam form. Table 1 shows background characteristics of NNAAP test for 
different examinee groups. 

Table 1 



Background Characteristics of the NNAAP Test for Different Groups 



Background Characteristics 


Written English 
(E) 


Written English 
+ 

Audio Tape (EA) 


Written Spanish 
+ 

Audio Tape (SA) 


Sample Size 


10908 


6140 


3520 


Raw Score Mean 


51.79 


47.38 


48.48 


Raw Score SD 


5.89 


6.66 


5.53 


Reliability (KR-20) 


0.84 


0.79 


0.76 



Data Analyses 

A series of structural equation modeling (SEM) procedures were conducted for this 
structure invariance study. For the purpose of cross-validation, subjects were randomly split into 
two samples to form a calibration and a validation sample (Byrne, 2001). One of the purposes 
for using a cross-validation strategy here is to assess the reliability of model fit. Having chosen 
a SEM model that is best for a particular sample of data, one may not automatically assume that 
this SEM model can be reliably applied to other samples of the same population. However, 
assuming the model fits well for the calibration sample, if the model also fits well for the 
validation sample, a different sample from the same population of interest, then we may say that 
this SEM model is reliable. 

The NNAAP model (Figure 1) is a structural model with three endogenous latent variables 
as first-order factors. The model represents the 1995 NNAAP job analysis content outline 
(National Council of State Boards of Nursing, 1995.) Nurse aide ability was defined as a 
person’s grasp of the basic knowledge and skills necessary to provide care to patients as a nurse 
aide, within regulatory guidelines. The first endogenous variable was measured by 3 subtests. 
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The second variable was measured by 2 subtests. The third endogenous variable was measuremd 
by 4 subtests. The 3 endogenous variables matched the 3 first level content areas of the outline. 
Nine subtests (observable variables) formed the second content levels. It was assumed that each 
of the groups’ subtests measured a unique aspect of the NNAAP test. One exogenous variable 
was a second-order factor called competence or ability. Candidate ability was hypothesized to 
account for all variance and covariance related to the first-order factors. For identification 
purposes, all three first-order factor variances were set equal, and first factor loadings from each 
of three endogenous variables and the variance of ability were scaled to 1.0. 

In order to evaluate the adequacy of the NNAAP factor model to fully account for the 
relationships among subjects, a series of SEM with the maximum likelihood estimation were 
conducted on the calibration sample, for each language and administration group. Once the 
fitting of model for each calibration sample was determined, the invariance of the model 
structure for the validation samples was investigated across dual languages (English and 
Spanish) and administration conditions (with or without audio assistant). All tests of invariance 
began with a global test of the equality of covariance structures across groups (Joreskog, 1971b.) 
The data for all groups were analyzed simultaneously to obtain efficient estimates (Bender, 

1995). Then, a series of nested constraints were equally applied to the same parameters across E, 
EA, and SA groups in order to detect the configuration and factor pattern difference across 
groups. The constraints used include, from weaker to stronger, (1) model structure, (2) model 
structure and factor loadings, and (3) model structure, factor loadings, and unique variance. 
Changes in goodness-of-fit statistics were examined to detect differences in structure parameters. 
Several well-known goodness-of-fit indexes were used to evaluate model fit: the Chi-square 
the comparative fit index (CFI), unadjusted goodness-of-fit indexes (GFI), the normal fit index 
(NFI), the Tucker-Lewis Index (TLI), the root mean square error of approximation (RMSEA) 
and the standardized root mean square error residual (SRMR). All analyses were conducted by 
using AMOS 4.0 (Arbuck & Wothke, 1999). For the group comparisons with increased 
constraints, the value provides the basis of comparison with the previously fitted model, 
however, a significant value of does not necessarily indicate a departure from invariance when 
the sample size is large because a chi-square test is correlated with sample size and will detect 
even minute differences between the hypothesized model and the data (Bollen & Long, 1993; 
Brown & Cudeck, 1993). 
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Figure 1. The NNAAP SEM Model of EA Test for Calibration Sample 
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Results 

Evaluation of Model Fit Across Calibration and Validation Samples 

Table 2 shows the fit indexes of the NNAAP model of cross-validation samples for different 
languages and administrations. Hu and Bentler (1999) recommend using combinations of 
goodness-of-fit indexes to obtain a robust evaluation of model fit. The criterion values they list 
for a model with a good fit are CFI>0.95, TLI>0.95, RMSEA<0.06, and SRMR<0.08. For this 
model, nearly all values satisfy the Hu and Bentler criteria for these four fit statistics. Other than 
the Chi-square, all values satisfy the Hu and Bentler criteria for these four fit statistics. Chi- 
squares are significant because the sample size is large. All the figures for GFI, AGFI, and NFI 
also support the evidence of fit for all groups. All factor loadings are reasonable and statistically 
significant. The overall picture suggests that the model provides reasonably close fits to the data 
and is cross-validated. 

Evaluation of Equivalence Across Language 

The goodness-of-fit indexes across languages in a nested series of tests are presented in 
Tables 3. Because both EA and SA groups, in both calibration and validation samples, used 
audio tapes, the only different factor between the two groups is the language factor (English and 
Spanish). For each of the EA and SA groups, the specified parameters for each constraint 
condition were constrained to be equal for both languages. For the calibration sample, the 
differences of between the EA + SA baseline and the Constraint I nested models are not 
statistically significant at the 0.05 level even given the large sample size. The differences of 
between Constraint II and I, Constraint III and II are significant and were expected for such large 
sample sizes. All other fit indexes are well under the Hu and Bentler (1999) criteria except for 
NFI and TLI for constraints II and III, and CFI for constraint III. For the validation sample, the 
fit indexes of GFI, NFI, TLI, RMSEA, and SRMR are all under Hu and Bentler criteria. This 
suggests that the factor structure, latent variances, and factor loadings of the NNAAP are the 
same for English and Spanish speakers. But the chances of unexplained unique variances 
varying across languages are still high. 
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Table 3 

Goodness-of- Fit Indexes of Invariance of Model Constraints* Across Groups Based on 
Calibration and Validation Samples 



Sample/Group 


df 




GFI 


NFI 


TLI 


CFI 


RMSEA 


SRMR 


Calibration Sample 


















E + EA Baseline 


50 


175.83 


1.00 


.99 


.99 


.99 


.02 


.01 


Constraint I 


53 


176.74 


.99 


.99 


.99 


.99 


.02 


.01 


Constraint II 


62 


253.87 


.99 


.98 


.99 


.99 


.02 


.03 


Constraint III 


71 


1752.59 


.95 


.89 


.90 


.90 


.05 


.03 


EA + SA Baseline 


50 ■ 


142.35 


.99 


.98 


.98 


.99 


.02 


.01 


Constraint I 


53 


143.59 


.99 


.98 


.98 


.99 


.02 


.02 


Constraint II 


62 


409.38 


.98 


.94 


.94 


.95 


.03 


.02 


Constraint HI 


71 


680.85 


.97 


.91 


.91 


.91 


.04 


.06 


E + SA Baseline 


50 


183.48 


.99 


.99 


.99 


.99 


.02 


.01 


Constraint I 


53 


189.15 


.99 


.99 


.99 


.99 


.02 


.02 


Constraint II 


62 


462.36 


.99 


.97 


.97 


.97 


.03 


.07 


Constraint HI 


71 


1229.09 


.96 


.91 


.91 


.91 


.05 


.09 


E + EA + SA Baseline 


74 


234.12 


.99 


.99 


.99 


.99 


.01 


.01 


Constraint I 


80 


256.76 


.99 


.99 


.99 


.99 


.01 


.02 


Constraint II 


98 


624.91 


.99 


.97 


.97 


.97 


.02 


.07 


Constraint HI 


116 


2491.02 


.95 


.87 


.88 


.87 


.04 


.09 


Validation Sample 


















E + EA Baseline 


50 


163.03 


1.00 


.99 


.99 


.99 


.02 


.02 


Constraint I 


53 


163.39 


1.00 


.99 


.99 


.99 


.02 


.02 


Constraint II 


62 


289.76 


.99 


.98 


.98 


.99 


.02 


.04 


Constraint HI 


71 


1710.53 


.95 


.90 


.90 


.91 


.05 


.03 


EA + SA Baseline 


50 


339.23 


.99 


.97 


.97 


.98 


.03 


.02 


Constraint I 


53 


341.16 


.99 


.97 


.97 


.98 


.03 


.03 


Constraint II 


62 


425.55 


.99 


.97 


.97 


.97 


.03 


.03 


Constraint HI 


71 


514.49 


.99 


.96 


.96 


.96 


.03 


.03 


E + SA Baseline 


50 


325.14 


.99 


.98 


.98 


.99 


.02 


.03 


Constraint I 


53 


326.50 


.99 


.98 


.98 


.99 


.02 


.02 


Constraint II 


62 


551.15 


.99 


.97 


.97 


.97 


.03 


.04 


Constraint HI 


71 


2405.79 


.95 


.87 


.88 


.88 


.06 


.05 


E + EA + SA Baseline 


74 


413.7 


.99 


.98 


.98 


.99 


.02 


.02 


Constraint I 


80 


415.90 


.99 


.98 


.98 


.99 


.02 


.03 


Constraint II 


98 


723.95 


.99 


.97 


.97 


.97 


.02 


.04 


Constraint HI 


116 


2990.66 


.95 


.88 


.89 


.88 


.04 


.05 


* The levels of model constraints restricted to be equal across language (or administration) are: 



I. Model structure and latent variable variance. 

II. Model structure, latent variable variance, and factor loading. 

III. Model structure, latent variable variance, factor loading, and unique variance. 
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Evaluation of Equivalence Across Administration Conditions 

Also from Table 3, for both calibration and validation samples, the goodness-of-fit results 

for E + EA groups show if the administration condition factor affects the equivalence of factor 

structure of the NNAAP because both groups used the same language (English) and the language 

2 

factor can be canceled out. For both calibration and validation samples, the differences of % 
between the E + EA baseline and the Constraint I nested models are not statistically significant at 
the 0.05 level even given the large size of the samples. The differences of between 
Constraints II and I, Constraints HI and II are significant and were expected for such large 
sample sizes. All other fit indexes are well under the Hu and Rentier (1999) criteria except NFI, 
TLI, and CFI for Constraint El. This suggests that the factor structure, latent variances, and 
factor loadings of the NNAAP are same for English and English with audio assistant. However, 
the chances of unexplained unique variances varying across administration condition are still 
high. 

Evaluation of Equivalence Across Language and Administration Condition 

The goodness-of-fit results for the E + SA and E + EA + SA groups for both the calibration 
and validation samples are shown in Table 3. Both conditions mixed language and 
administration conditions together. Any differences of nested models across language and 
administration condition could be due to the either language factor or the administration 
condition factor, or even both. For both the calibration and validation samples, the differences of 
xj' between the E + SA baseline and the Constraint I nested models are not statistically 

2 

significant at the 0.05 level even given the large size of the samples. The differences of x 
between Constraints II and I, Constraints III and II, are significant and were expected for such 
large sample sizes. All other fit indexes are well under the Hu and Rentier (1999) criteria except 
NFI, TLI, CFI, and SRMR for constraint III. The differences of x^ between the E + EA + SA 
baseline and the Constraint I nested models are not statistically significant at the 0.05 level even 
given the large size of the samples. The differences of x^ between Constraints E and I, 
Constraints El and II, are significant and were expected for such large sample sizes. All other fit 
indexes are well under the Hu and Bentler (1999) criteria except NFI, TLI, CFI, RMSEA and 
SRMR for Constraint III. This suggests that the factor structure, latent variances, and factor 
loadings of the NNAAP are the same for English, English with audio assistant, and Spanish with 
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audio groups. However, the chances of unexplained unique variances varying across 
administration condition are still high. 

Practical Implication 

The present study examined the comparability of NNAAP scores across language and 
administration condition groups for calibration and validation samples that were randomly drawn 
from the same population. Results show that factor structure validities of the NNAAP are well 
supported. Statistically significant (or difference of statistics occur because of the large 
sample sizes. For this reason, it is frequently appropriate to conclude that a SEM model fits the 
data even if p is significant (Joreskog & Sorbom, 1989; Mulaik, James, Alstine, Bennett, Lind, & 
Stillwell, 1989). The values of all other fit statistics (CFI, AGFI, NFI, TLI, RMSEA, and 
SRMR) fall within the bounds of Hu and Bender’s (1999). Thus, the overall pattern of fit 
statistics for the NNAAP data indicates a reasonable fit even when the chi-square test suggests 
rejection of factor models when sample sizes are large. The evidence of fit holds for both the 
calibration and validation samples for language and administration condition groups. Further 
evidence of the invariance of factor structure of the NNAAP scores across language and 
administration groups is found in all fit statistics when model structure, factor loading, and latent 
variable variance are constrained to be equal across groups except unique variance. Thus, the 
data suggest that this construct is similarly structured (fair) across different language and 
administration condition groups. 

In summary, this study underscores the importance of empirical validation of certification 
exams and provides evidence supporting the validity and fairness of a widely used national 
exam. It carries the validation process beyond the content-related evidence (job analysis) that 
often serves as the sole documented support of validity for credentialing exams. By publicizing 
the results of this study, we hope to encourage the credentialing community to strengthen the 
validity of its exams by investigating their factor structure and making modifications, if 
warranted, to ensure that the same constructs are measured regardless of language and 
administration condition. We also hope to encourage the practice of providing evidence of 
validity from a variety of sources, thus strengthening the defensibility of licensure and 
certification exams across the board. 
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